Introduction to R

In this lab, we will introduce the basic functionality of R, together with some simple plotting functions. We will be using the following files for these examples:

  • A dataset of morphological measurements of three species of penguin from the file penguins.csv

More information about this dataset can be found here: https://allisonhorst.github.io/palmerpenguins/index.html

A quick note on formatting

In this and subsequent labs, code that can be entered into R will be high-lit, e.g.:

plot(x, y)

And R output will be formatted with ## at the start of the line. File names will be given in italics and will be available in the ‘data’ directory on the course Canvas site.

Getting started

RStudio layout

The R Studio interface consists of several windows. Start R Studio from the ‘Start’ menu under Windows, and the following window should appear:

RStudio Interface
RStudio Interface
  • Bottom left: console window (also called command window). Here you can type simple commands after the > prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.
  • Top left: editor window (also called script window). Collections of commands (scripts) can be edited and saved. When you don’t see this window, you can open it with [File \(>\) New \(>\) R script]. Just typing a command in the editor window is not enough, it has to get into the command window before R executes the command. If you want to run a line from the script window (or the whole script), copy and paste it to the console. Alternatively, you can click [Run] or press CTRL+ENTER to send it to the command window.
  • Top right: workspace / history window. In the workspace window you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before.
  • Bottom right: Files / plots / packages / help window. Here you can open files, view plots (also previous plots), install and load packages or use the help function. You can change the size of the windows by dragging the grey bars between the windows.

Note that you can rearrange the order of these panels, so this may look different on other computers.

Working with R

Much of your time spent with R will involve typing commands in at the console, and R Studio has some help with this.

  • The up/down arrow keys allow you to cycle back through commands that have been previously entered. You can then modify or reuse these as necessary. The commands can also be accessed through the `History’ tab in the top-right panel
  • The console has ‘tab-completion’, which allows you to enter the first few characters of a string or function name, press ‘Tab’, and R Studio will bring up a list of possible options that match the string you entered. Try typing pri and pressing ‘Tab’ - you should see print as part of the list, and you can click on this, or scroll down to use it from the list.

Workspace

R has a workspace where variables and data are stored as you use it. This is held in the memory of the computer, so if you are working from a file, you will need to read it in to the R workspace, and then work on the values held in memory. This means that you only access files to read in or write out data, the rest of the time you are working on a copy in the workspace.

Working directory

R defines the working directory as the folder in which it is currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or plot, it will save it in the working directory. Once you have done this, download all the files from the training session Google drive and move them to the data folder.

For this class, the labs will assume that you have your files organized according to the following structure:

+-- ugic2024
|   +-- data

To do this, go to your Documents folder, and create a new folder called ugic2024. In this now create a new folder called data (where we will store all the data used today.

Once you have created these folders, we need to change R’s working directory so that it is pointing to ugic2024. The easiest way to do this is by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder. (If you are using the base version of R, go to [File] \(>\) [Change dir…] in Windows, or [Misc] \(>\) [Change Working Directory] in Mac.)

You can also change the working directory manually using the setwd() function in the console. To do this, you may need to know the full path to the folder on your computer. If you followed the instructions given above, this should be:

  • On a Windows system: C:/Users/username/Documents/ugic2024
  • On a Mac OSX system: /Users/username/Documents/ugic2024

Where username is your name on the computer. You can also find this path by

  • On a Windows system: - Use the File Explorer to select the folderugic2024`
    • Right-click the folder and select ‘Properties’
    • In the pop-up window, the path will be listed under ‘Location’
    • Highlight this, and copy the path
  • On a Mac OSX system:
    • Use the Finder app to select the folder ugic2024
    • Go to the “View” menu then “Show Path Bar”. THis will make the full path appear at the bottom of the Finder window
    • Right-click the Path Bar file name
    • Select “Copy as Pathname” You can now run the setwd() command. Go to the console window in RStudio and enter the following code:
setwd("")

And paste your directory between the quotes. The code should look something like this (but with your actual user name):

setwd("C:/Users/username/Documents/ugic2024/")

Note that the slashes are forward slashes and don’t forget the quotations. R is case sensitive, so make sure you write capitals where necessary. To check that you have correctly changed directory, enter the following command, which will show you the current working directory:

getwd()

You can also use relative paths. If your current working directory is ugic2024 and you want to change to data, enter the following code (where the ./ changes the directory to a level higher than the current one).

setwd("./data")

If your current working directory is data and you want to change to ugic2024, enter the following code (where the ../ changes the directory to a level below the current one).

setwd("../")

Before proceeding with the rest of today’s lab, make sure to change your working directory back to ugic2024.

If this all seems a little foreign to you, don’t worry - there will be plenty of opportunities to practice this over the day. Understanding the directory structure is very important in being able to manage your files both for this training session and any analysis you will do later.

R Projects

R Projects offer another way to manage your work in R. These group together a set of scripts and data held within a folder into a single project, detailed in an RProj file. We won’t use these here, but if you are planning to use R for a lot of different analyses and work, these are an excellent way to keep things organized.

Using the console

In the console, the ‘>’ is the prompt, and your commands will be entered here. Click on the console window, then enter the following:

2+2
## [1] 4

And press ‘Enter’, and R will tell you, not too surprisingly, that 2+2=4. The spacing is not important: you could equally enter 2 + 2 or 2+ 2 and get the same result. The [1] before the output is a vector index. It refers to the first value in the output vector (here a vector of length 1). We’ll be using this later.

We can equally use standard math functions, for example, to take the natural log or square root of 2:

log(2)
## [1] 0.6931472
sqrt(2)
## [1] 1.414214

Asssigning output

So far, these commands have runs some calculations and displayed the results in the console. If you need to store any R output for further use, you will need to assign to to a variable. There are two assignment operators in R <- and =. These are interchangeable, and you will see both used in R examples. To store the results of the previous commands:

a = log(2)
b = sqrt(2)

If you now look in the top right corner in the ‘Environment’ window, you should see these two variables appear. These are now held in R’s workspace and can be reused:

a + b
## [1] 2.107361

File input and output

R can use many different file types, but comma separated value (csv) files are most frequently used as the easiest way to transfer between R and Excel. Make sure you have changed your working directory to the ugic2024 folder. Then get a list of csv files in the data folder as follows (note the use of the pattern parameter to get only certain files):

list.files("./data/", pattern=".csv")
## [1] "lsl.csv"        "penguin2.csv"   "penguins.csv"   "WNAclimate.csv"

Let’s read in the data from the Penguin file (penguins.csv, make sure this appeared in the list from the previous command and ask me if you don’t see it). CSV files can be read in using the read.csv() function:

penguin <- read.csv("./data/penguins.csv")

Note that because this file is held in a different folder (data) to your current working directory (ugic2024), you need to provide the relative path (./data).

The first part of this code (penguin <-) tells R to store the data read in from the file in a data frame called penguin. To print out the contents of any object in R, simply type the name of that object at the command prompt.

penguin

Other useful commands are class() to see what data class an object is (a dataframe), and names() to get a list of the column headers. The function str() is probably the most useful, describing the column names and the type of data stored in them.

class(penguin)
names(penguin)
str(penguin)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : chr  "male" "female" "female" NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

You can write out dataframes to csv files with write.csv(). Here we’ll write the penguin data back out to a new file (normally, we’d only do this after transforming or working with the data, but this is just an example). Once you’ve run this, check the data folder to make sure the new file has been created.

write.csv(penguin, "./data/penguin2.csv", row.names = FALSE)

You can also write out the entire contents of R’s workspace to a binary file (*.RData), with the save function. This can be useful if you want to save a snapshot of your current R session with all the variables and objects you are working with. To save the current workspace:

save(list = ls(), file = "dat.RData")

We can now clear the current workspace (check the environment window before and after you do this):

rm(list = ls())

And if we now reload the binary file, you should see that the variables we had previously (penguin, a, b) reappear in the environment.

load(file = "dat.RData")

Data manipulation

When we read in the contents of the file, it was store as a dataframe with a variable name penguin - this represents the whole dataset. Normally however, you will need to work with subsets of the data (e.g. individual columns or sets of rows). There are two common ways to do this, one with base R and one using the tidyverse, that we will explore here. The tidyverse syntax is becoming more prevalent, especially for data science work, but there are still times when understanding the base R approach can be useful, e.g. when working with matrices or arrays.

Base R

If you want to use base R to access subsets of the data frame or individual values, you will need to understand how R indexes data. R uses a 1-based indexing, which means that the first value in any set of data will be indexed at 1, then at 2, etc. This is in contrast to some other languages (e.g. Python) that use 0-based indexing.

Vectors

To show how this works, we’ll first create a vector of random numbers and print the content (note that your values will be different):

x <- rnorm(10)
x
##  [1]  1.37095845 -0.56469817  0.36312841  0.63286260  0.40426832 -0.10612452
##  [7]  1.51152200 -0.09465904  2.01842371 -0.06271410

To access the first value:

x[1]
## [1] 1.370958

Or to access the third:

x[3]
## [1] 0.3631284

You can also use the colon : to slice the data, i.e. extract the data between two indices:

x[1:3]
## [1]  1.3709584 -0.5646982  0.3631284

Or you can use a set of irregular indices by concatenating them together with the c() function:

x[c(3,5,7)]
## [1] 0.3631284 0.4042683 1.5115220

You can also use a reverse index with a - symbol. This will extract all values except the one with that index:

x[-1]
## [1] -0.56469817  0.36312841  0.63286260  0.40426832 -0.10612452  1.51152200
## [7] -0.09465904  2.01842371 -0.06271410

Matrices

Matrices in R are index by [row,col], so you need to provide both of these to extract subsets of data. Here we’ll make up a small matrix with values from 2 to 24 on a step of 2:

x <- matrix(seq(2, 24, by = 2), nrow = 3, ncol = 4)

To get the very first entry, we index it with [1,1]:

x[1,1]
## [1] 2

And to get the last:

x[3,4]
## [1] 24

If you only provide one index, then it will extract all values for that row/columns:

x[1,] # first row
## [1]  2  8 14 20
x[,4] # fourth column
## [1] 20 22 24

As before, you can use : to extract slices. This will extract all values in the 2nd to 4th columns

x[, 2:4]
##      [,1] [,2] [,3]
## [1,]    8   14   20
## [2,]   10   16   22
## [3,]   12   18   24

Or the first row for the same columns:

x[1, 2:4]
## [1]  8 14 20

Dataframes

At this point, you may be wondering how all this relates to data frames - which is how the penguin data was stored when it was read in from the csv file earlier. Data frames are similar to 2D matrices as they are composed of rows (representing observations) and columns (representing variables). As a result, you can use the same row/column indexing to access subsets of data:

penguin[ ,4]   # 4th column
penguin[10, ]  # 10th row

And as before, you can access a range of rows and columns using ::

penguin[ ,1:4]     # Columns 1 to 4
penguin[1:10, ]    # First 10 rows
penguin[1:50,1:2]  # First 50 rows of the first two columns

Dataframes also use a $ notation, which allows to access individual columns or variables. For example, to extract the bill length variable:

names(penguin)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
penguin$bill_length_mm      # Extract single column

The advantage of this is that you can access a column, even if you don’t know wthe column number. You can also combine this with the vector indexing (see above) to access specific values within a vector:

penguin$bill_length_mm[3]      # 3rd element
penguin$bill_length_mm[-3]     # All but 3rd element
penguin$bill_length_mm[1:10]   # First 10 elements

Logical operators \(<, <=, >, >=, ==, !=\) can be used to select parts of the data set by value. This is very useful if you only want to analyze part of your dataset:

penguin$bill_length_mm[penguin$bill_length_mm > 40] # All bill lengths over 40 mm
penguin[penguin$bill_length_mm > 40, ]  # All columns with bill length length over 40 
penguin$bill_length_mm[(penguin$species == 'Adelie')]   # All "Adelie" penguins
penguin[(penguin$species == 'Adelie'), ]    # All columns for "Adelie" penguins

These operators can be combined, so to get all instances of “Adelie” species with bill lengths greater than 40:

penguin[(penguin$species == 'Adelie') & (penguin$bill_length_mm > 40), ]

Matrices vs dataframes

The main difference is that dataframes can contain different data classes, where as matrices and vectors can only contain a single class. Compare the matrix we made earlier:

class(x)            
## [1] "matrix" "array"
class(x[, 1])
## [1] "numeric"
class(x[, 2])
## [1] "numeric"

To the content of the dataframe

class(penguin)          
## [1] "data.frame"
class(penguin$bill_length_mm)       
## [1] "numeric"
class(penguin$species)          
## [1] "character"

Data manipulation with the tidyverse

The tidyverse is a set of add-on packages for R that are largely designed to streamline working with and visualizing data in R. One of these packages (dplyr) is used to transform and summarize tabular data with rows and columns, like R’s dataframes. It can be used as a replacement for much of the indexing described above. The package contains a set of functions (or “verbs”) that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.

As this is an add-on package, you will need to download and install it. YOu can install the entire set of tidyverse packages with:

install.packages("tidyverse")

Or just dplyr with:

install.packages("tidyverse")

Note that you will only need to install this once, but each time you want to use it in a new R session, you will need to load it into memory using the library function:

library(dplyr)

dplyr has a wide variety of operations, but the most important ones are:

Verb Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarize values
group_by() allows for group operations

These functions are commonly used with a pipe operator, which can be used to chain functions together. This is loaded automatically with dplyr so you don’t need to worry about loading it. The pipe operator is written as %>% (or |>), and takes the output from one function and pipes it directly to another function. In the first example below, we’ll look at using one of these functions without and then with the pipe, and then use the pipe throughout the rest of this lab.

Selecting by column

The select() function allows the selection of columns by name. To use this without the pipe, add the data frame name in the function parentheses. Here we select two columns from the penguin dataframe, and then use the head function to show the first few rows.

spp_yr <- select(penguin, species, year)
head(spp_yr) # Show the first 6 rows
##   species year
## 1  Adelie 2007
## 2  Adelie 2007
## 3  Adelie 2007
## 4  Adelie 2007
## 5  Adelie 2007
## 6  Adelie 2007

We can rewrite this as a single line using the %>% pipe and avoiding the need for the intermediate data frame spp_yr. Note there are two pipes: the first sends the dataframe to the select() function, and the second sends the output of this function to the head() function:

penguin %>% 
  select(species, year) %>% 
  head()
##   species year
## 1  Adelie 2007
## 2  Adelie 2007
## 3  Adelie 2007
## 4  Adelie 2007
## 5  Adelie 2007
## 6  Adelie 2007

For most of these examples, we won’t save the output, but you can easily store this by assigning it to a new variable:

spp_yr <- penguin %>% 
  select(species, year) 

To select all the columns except a specific column, use the “-” (subtraction) operator:

penguin %>% 
  select(-year) %>% 
  head()

To select a range of columns by name, use the “:” (colon) operator (as we did with selecting multiple columns using indices)

penguin %>% 
  select(bill_length_mm:body_mass_g) %>%
  head()

To select all columns that start with the character string “bill”, use the function starts_with()

penguin %>%
  select(starts_with("bill"))

Here are some additional functions to select columns based on a specific criteria:

  • ends_with() = Select columns that end with a character string
  • contains() = Select columns that contain a character string
  • matches() = Select columns that match a regular expression
  • one_of() = Select columns names that are from a group of names

Filtering by row

The filter() function allows the selection of rows. To filter the data for rows where the body mass is over 4000 grams:

penguin %>% 
  filter(body_mass_g > 65)

You can add mulitple conditions in the filter function. For example, to filter for male penguins with a body mass of over 4000 g:

penguin %>%
  filter(sex == "male", body_mass_g > 4000)

To filter for male penguins with a body mass of over 4000 g, and are from Torgensen island:

penguin %>%
  filter(sex == "male", body_mass_g > 4000, island == "Biscoe")

Filter for female penguins from Biscoe and Torgensen islands

penguin %>%
  filter(sex == "female", island %in% c("Biscoe","Torgensen"))

Note that there is also a slice() function, which simply extracts rows according to their position - this is the equivalent to the indexing we have done before:

penguin %>%
  slice(1:10)
##    species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1   Adelie Torgersen           39.1          18.7               181        3750
## 2   Adelie Torgersen           39.5          17.4               186        3800
## 3   Adelie Torgersen           40.3          18.0               195        3250
## 4   Adelie Torgersen             NA            NA                NA          NA
## 5   Adelie Torgersen           36.7          19.3               193        3450
## 6   Adelie Torgersen           39.3          20.6               190        3650
## 7   Adelie Torgersen           38.9          17.8               181        3625
## 8   Adelie Torgersen           39.2          19.6               195        4675
## 9   Adelie Torgersen           34.1          18.1               193        3475
## 10  Adelie Torgersen           42.0          20.2               190        4250
##       sex year
## 1    male 2007
## 2  female 2007
## 3  female 2007
## 4    <NA> 2007
## 5  female 2007
## 6    male 2007
## 7  female 2007
## 8    male 2007
## 9    <NA> 2007
## 10   <NA> 2007

Using multiple functions

The real power of dplyr comes when you start to combine functions. For example here, we’ll extract just the body mass and species name for the penguins from Dream island:

penguin %>%
  filter(island  == "Dream") %>%
  select(species, body_mass_g)

The arrange() function

This function allows us to arrange (or re-order) rows by a particular column. So to arrange by increasing life expectancy (I’m using the head() function to limit the amount of data that appears on the screen):

penguin %>% 
  arrange(bill_length_mm) %>% 
  head()
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie     Dream           32.1          15.5               188        3050
## 2  Adelie     Dream           33.1          16.1               178        2900
## 3  Adelie Torgersen           33.5          19.0               190        3600
## 4  Adelie     Dream           34.0          17.1               185        3400
## 5  Adelie Torgersen           34.1          18.1               193        3475
## 6  Adelie Torgersen           34.4          18.4               184        3325
##      sex year
## 1 female 2009
## 2 female 2008
## 3 female 2008
## 4 female 2008
## 5   <NA> 2007
## 6 female 2007

And you can include the desc() function to reverse the sort order:

penguin %>% 
  arrange(desc(bill_length_mm)) %>% 
  head()
##     species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1    Gentoo Biscoe           59.6          17.0               230        6050
## 2 Chinstrap  Dream           58.0          17.8               181        3700
## 3    Gentoo Biscoe           55.9          17.0               228        5600
## 4 Chinstrap  Dream           55.8          19.8               207        4000
## 5    Gentoo Biscoe           55.1          16.0               230        5850
## 6    Gentoo Biscoe           54.3          15.7               231        5650
##      sex year
## 1   male 2007
## 2 female 2007
## 3   male 2009
## 4   male 2009
## 5   male 2009
## 6   male 2008

Note that you can add additional variable names to arrange() to sort on consecutive columns.

Now, we will filter for female penguins, select four columns from the dataset, arrange the rows by the increasing body_mass then pass this to head() to show the first few rows:

penguin %>% 
  filter(sex == "female") %>%
  select(island, year, body_mass_g, flipper_length_mm) %>%
  arrange(body_mass_g) %>%
  head()

The distinct() function

The function distinct() will return the unique values of a vector, so to get the list of islands in the penguin data set

penguin %>%
  distinct(island)

So, if we want the list of islands that have female penguins with over 4000 g body mass:

penguin %>%
  filter(sex == "female", body_mass_g > 4000) %>%
  distinct(island)

The mutate() function

This function can be used to add new columns to the data frame. We’ll use this to create a new column of with the body mass in kilograms:

penguin %>% 
  select(island, body_mass_g) %>%
  mutate(body_mass_kg = body_mass_g / 1000)

And let’s sort by this by weight to find the islands with the highest masses:

penguin %>% 
  select(island, body_mass_g) %>%
  mutate(body_mass_kg = body_mass_g / 1000) %>%
  arrange(desc(body_mass_kg)) %>%
  head()

Note that this does not add the column to the original dataframe, unless you assign the output:

penguin <- penguin %>% 
  mutate(body_mass_kg = body_mass_g / 1000)
names(penguin)

We’ll look at a couple of other dplyr functions, shortly, but first, we’ll explore what an R function.

Functions in R

Functions typically are comprised of the name of the function (sqrt for taking square roots) and a set of parentheses. The parentheses are used to pass data to the function as well as setting parameters to change the behavior of the function.

sqrt(5)
## [1] 2.236068

Note that we can use the assignment operator to save the output from a function, allowing you to use this in subsequent functions and analyses.

y <- sqrt(5)
round(y)
## [1] 2

To save time and code, functions can be combined:

round(sqrt(5))
## [1] 2

Most functions take a series of arguments that control the way they work. As an example, we’ll look at the seq() function, which produces a series of numbers on a regular step. By default, it require 3 arguments, the starting number, the ending number and the step.

seq(from = 0, to = 20, by = 2)
##  [1]  0  2  4  6  8 10 12 14 16 18 20

If you include the argument names, as in this example, the order does not matter. The argument names can be omitted if you keep to the specified order of arguments So seq(0,20,2) will give you the equivalent results.

To find out what these arguments are, what they are called and what values they take, use the help() function, e.g. help(seq) or just ?seq. This will open a window with the help file for that function. If you do not know the name of a function, there is a search function help.search(), or use the help browser help.start(); browse to packages or use the search engine.

Univariate statistics

R has a large number of inbuilt functions. This section is designed to simply introduce you to the some basic functions for estimating simple univariate statistics We’ll start by simply calculating the mean of the bill length values

mean(penguin$bill_length_mm)
## [1] NA

This returns the value NA, rather than a mean length. So what went wrong? In the original set of data, there are some missing values, also denoted by NA.

penguin$bill_length_mm[1:15]
##  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6

R’s default for most functions is to not calculate values when there are missing observations. This is really to alert you to the fact that the data are incomplete, and the value you would obtain might be biased. You can overrule this by adding the argument na.rm=TRUE to the following functions. This removes NAs and calculates the value with whatever is leftover.

Functions to describe the central tendency:

mean(penguin$bill_length_mm, na.rm = TRUE)
## [1] 43.92193
median(penguin$bill_length_mm)
## [1] NA

Functions to describe the dispersion (output not shown):

sd(penguin$bill_length_mm, na.rm = TRUE) ## Standard deviation
var(penguin$bill_length_mm) ## Variance
min(penguin$bill_length_mm)
max(penguin$bill_length_mm, na.rm = TRUE)
quantile(penguin$bill_length_mm, na.rm = TRUE)

Note that quantile() takes a parameter that allows you to choose the quantile to be calculated, e.g. quantile(bl, c(0.1,0.9), na.rm = TRUE), will calculate the 10th and 90th percentile. Try adapting this to calculate the 25th and 75th percentile.

Some other useful functions:

sum(penguin$bill_length_mm, na.rm = TRUE)
## [1] 15021.3
table(penguin$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124
summary(penguin$bill_length_mm)

As R is object oriented, functions will adapt to different data types. Compare:

summary(penguin$bill_length_mm) ## Summary of numeric vector
summary(penguin$species) ## Summary of numeric vector
summary(penguin) ## Summary of data frame

The summarize() function

These functions can be used with the dplyr function summarize() to create summary statistics for a given column in the data frame, for example, finding the mean. To compute the average body mass, use the mean() function with the new column body_mass_kg.

penguin %>% 
    summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE))
##   avg_body_mass
## 1      4.201754

You can use most of the functions that calculate summary statistics described above. There are a number of others that are useful here, including n() to ge the length of a vector:

penguin %>% 
    summarise(count = n())
##   count
## 1   344

And n_distinct() returns the number of distinct values in vector.

penguin %>% 
    summarise(count_islands = n_distinct(island),
              count_species = n_distinct(species))
##   count_islands count_species
## 1             3             3

We can then easily set up a function to calculate a range of summary statistics as follows:

penguin %>% 
    summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE),
              sd_body_mass = sd(body_mass_kg, na.rm = TRUE),
              min_body_mass = min(body_mass_kg, na.rm = TRUE),
              max_body_mass = max(body_mass_kg, na.rm = TRUE))
##   avg_body_mass sd_body_mass min_body_mass max_body_mass
## 1      4.201754    0.8019545           2.7           6.3

Although this may not seem that useful (relative to using the functions on their own), combining this with a second function (group_by) really expands this use.

The group_by() function

The group_by() function is a very useful addition to these other functions. It is related to concept of “split-apply-combine”, that for many analyses, we literally want to split the data frame by some variable (e.g. island or year), apply a function to the individual data frames and then combine the output.

Let’s do that: split the penguin data frame by species, calculate summary statistics (as above), then return everything in a new data frame, giving a set of summary statistics for each country.

penguin %>% 
  group_by(species) %>%
    summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE),
              sd_body_mass = sd(body_mass_kg, na.rm = TRUE),
              min_body_mass = min(body_mass_kg, na.rm = TRUE),
              max_body_mass = max(body_mass_kg, na.rm = TRUE))
## # A tibble: 3 × 5
##   species   avg_body_mass sd_body_mass min_body_mass max_body_mass
##   <chr>             <dbl>        <dbl>         <dbl>         <dbl>
## 1 Adelie             3.70        0.459          2.85          4.78
## 2 Chinstrap          3.73        0.384          2.7           4.8 
## 3 Gentoo             5.08        0.504          3.95          6.3

We can also group by two variables, here, we’ll just calculate the mean body mass by species and island, and arrange by the mean:

penguin %>% 
  group_by(species, island) %>%
    summarise(avg_body_mass = mean(body_mass_kg, na.rm = TRUE)) %>%
  arrange(avg_body_mass)
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.

Plotting with R

Base R

We’ll now look quickly at the basic plotting functions in R. R has a wide range of plotting types, and we will look at some more complex methods later in this class. For now, we will concentrate on the basic plotting function (plot()) and how to simply modify this.

Scatter plots

The basic R function for plotting (plot()) will produce a scatter plot of two variables:

plot(penguin$bill_length_mm, penguin$bill_depth_mm)

As we know that these values come from three difference species, we can use this knowledge to add extra information to the plot, by using the col parameter. We first convert the Species vector to a factor class. R will then use the levels of the factor to assign colors to the points. We can also change the symbol type using the pch parameter.

penguin$species <- as.factor(penguin$species)

plot(penguin$bill_length_mm, penguin$bill_depth_mm,
     col = penguin$species, pch = 16)

Let’s clean up this plot a little by specifying the axis labels and a title:

plot(penguin$bill_length_mm, penguin$bill_depth_mm,
     col = penguin$species, 
     pch = 16, 
     xlab = "Bill length (mm)", 
     ylab = "Bill depth (mm)",
     main = "Penguin size measurements (Palmer Archipelago)")

We can also add a legend to our plot to explain the different colors and symbols. Unfortunately, R makes you do all the work for this, using the legend() function. Here, we add a legend to the top left of the plot, giving the labels for each color and the color used:

legend("bottomleft",
       legend = c("Adelie","Chinstrap","Gentoo"),
       col = c(1,2,3), 
       pch = 16)

Histograms

Histograms are commonly used to visualize the distribution of a set of values. These are ‘binned’ into a set of classes, and the histogram represents the frequency of occurrences in that bin.

hist(penguin$bill_length_mm)

Bins can be defined with the breaks parameter, which may be set to a constant number in which case the data range is split into that many bins, or as a sequence of numbers defining the intervals between bins. In this latter case, we can make use of the seq() function from earlier.

hist(penguin$bill_length_mm, breaks = 20)

hist(penguin$bill_length_mm, breaks = seq(30, 60, 2.5))

An alternative to histograms are boxplots, which show information about the data quartiles. Here the box represents the interquartile data (25-75% of the data), the thick bar is the median, and the ‘whiskers’ show the data range.

boxplot(penguin$bill_length_mm)

More usefully, we can look at boxplots across a set of classes.

boxplot(penguin$bill_length_mm ~ penguin$species, 
        ylab = 'Bill length (mm)')

Note that this code uses a tilde (\(\sim\)) between the variable and the set of factors. The tilde is often used to define dependency between two variables, and we will return to this again during the modeling part of this class.

Graphic output

By default, R plots graphics to the screen, but has the ability to save figures in most of the standard graphic formats. In order to do this, you first need to open a file (a graphics device), then run the plotting functions, then close the device. Remember that you need to plot all the layers of a figure before closing the file. The following example plots the penguin bill length data to a pdf file. Alternatives include: png, jpeg, svg, etc.; type help(Devices) for more details.

pdf("penguin_boxplot.pdf")
boxplot(penguin$bill_length_mm ~ penguin$species, 
        ylab = 'Bill length (mm)')
dev.off()

Alternatively, you can copy-paste directly into Word by going to [Export] -> [Copy to clipboard…] in R Studio’s plotting window.

ggplot2

The ggplot2 package forms part of the tidyverse set of packages, and implements the Grammar of Graphics framework developed by Leland Wilkinson. The idea behind the grammar of graphics is that all plots can be described by a common language, rather than considering them as separate barplots, line charts, etc. What differs is the coordinate system and geometry used to place the data on the page. Using this package requires a little more work than standard plots, but the results are usually worth while. If you installed the tidyverse set of packages earlier, you will already have this on your system. If not, install it with install.packages("ggplot2"). Then load the library:

library(ggplot2)

In order to understand how ggplot makes a figure, we need to establish what the fundamental parts are of every data graph. They are:

  1. Aesthetics – these are the roles that the variables play in each graph. A variable may control where points appear, the color or shape of a point, the height of a bar and so on.
  2. Geometries – these are the geometric objects which represent the data: points, lines, bars, …
  3. Statistics – these are the functions which add some interpretation to the data, e.g. best fit line, location of median, etc
  4. Scales – these are legends that show the relationship between variables and different symbols or colors (e.g. circular symbols represent females while squares represent males)
  5. Facets – these are groups in your data which may be used to make multiple graphs, each for one of the groups. For example, faceting by gender would cause the graph to repeat for the two genders.

When making a ggplot figure, we generally start by creating the base figure. To do this we need to tell the function where the data is coming from, and the base aesthetic (i.e. which variable is x, which is y?). To remake the first scatter plot with ggplot(), we start by doing the following:

ggplot(penguin, aes(x = bill_length_mm, y=bill_depth_mm))

This creates the base plot with correct axes, but as you’ll see, there is no data on it. This is because we have not specified the geometry - the way in which we want the data to be plotted. ggplot2 works as a series of layers, so we can add (+) the geometry to this base as follows:

ggplot(penguin, aes(x = bill_length_mm, y=bill_depth_mm)) +
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

We can specify symbols and colors for the points in the geom_ function:

ggplot(penguin, aes(x = bill_length_mm, y=bill_depth_mm)) +
  geom_point(color = 'darkorange', shape = 4, size = 4)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

More usefully, we can use any of the categorical variables to set the color. To do this, we need to use them when creating the aesthetic, so that different colors will be used for each class. This also adds a legend directly.

ggplot(penguin, aes(x = bill_length_mm, 
                    y = bill_depth_mm,
                    col = species)) +
  geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

You can also add a shape aesthetic using another categorical variable (here we’ll at the different sexes):

ggplot(penguin, aes(x = bill_length_mm, 
                    y = bill_depth_mm,
                    col = species,
                    shape = sex)) +
  geom_point()
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

Note that there are some missing values in the sex column. This allows us to demonstrate another neat aspect of ggplot2 - it can be combined with dplyr for preprocessing. Here, we’ll use the filter() function and is.na() to remove the missing values and pipe the output directly to ggplot:

penguin %>%
  filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm, 
                    y = bill_depth_mm,
                    col = species,
                    shape = sex)) +
  geom_point()

We can next add scales to set the x and y label. These functions can do quite a bit more, e.g. logarithmic scaling or scaling colors. We’ll also set a theme to remove the default gray background (there’s a few of these, and several more that can be added with the ggthemes package).

penguin %>%
  filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm, 
                    y = bill_depth_mm,
                    col = species,
                    shape = sex)) +
  geom_point() +
  scale_x_continuous("Bill Length (mm)") +
  scale_y_continuous("Bill Length (mm)") +
  theme_bw()

We’ll just make one more adjustment to this figure. We’ll add a facet layer. This splits the original plot into a set of small multiples, based on another categorical variable (here we’ll use the year).

penguin %>%
  filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm, 
                    y = bill_depth_mm,
                    col = species,
                    shape = sex)) +
  geom_point() +
  scale_x_continuous("Bill Length (mm)") +
  scale_y_continuous("Bill Length (mm)") +
  theme_bw() +
  facet_wrap(~year) 

In addition to the points, ggplot2 has several other geometries for lines, barplots, histograms, polygons, and even maps (we’ll look at this later). Here’s just a few examples:

  • Histograms
ggplot(penguin, aes(x = bill_length_mm)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

- Histograms by class (species) with transparency (alpha)

ggplot(penguin, aes(x = bill_length_mm, fill = species)) +
  geom_histogram(position = 'identity', alpha = 0.7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

- Boxplots

ggplot(penguin, aes(x = species, y = bill_length_mm)) +
  geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

  • Boxplots by class
penguin %>%
  filter(!is.na(sex)) %>%
  ggplot(aes(x = species, y = bill_length_mm, fill = sex)) +
  geom_boxplot()

  • Barplots. The default is stacked barplots, add position = 'dodge' to the geom_bar() function to change to side-by-side bars
penguin %>%
  filter(!is.na(sex)) %>%
  ggplot(aes(x = island, fill = species)) +
  geom_bar()

  • Barplots. You can change to horizontal bars by flipping the axes:
penguin %>%
  filter(!is.na(sex)) %>%
  ggplot(aes(x = island, fill = species)) +
  geom_bar(position = 'dodge') + 
  coord_flip()

Making interactive plots

One great advantage of the RStudio interface is that it has made it easier to make interactive plots with R. One library that allows you do this is plotly, an interface to the well-known plotly API. This has a set of functions for building quite complex data visualizations, but also comes with a helper function ggplotly() that makes any ggplot2 figure interactive.

Here, we’ll remake the scatterplot we make earlier, but rather than showing it on the screen, we’ll save it to an gg object called p1:

p1 <- penguin %>%
  filter(!is.na(sex)) %>%
ggplot(aes(x = bill_length_mm, 
                    y = bill_depth_mm,
                    col = species,
                    shape = sex)) +
  geom_point() +
  scale_x_continuous("Bill Length (mm)") +
  scale_y_continuous("Bill Length (mm)") +
  theme_bw()

This object holds all of the information to make the figure. If you just type p1 in the console, you’ll see this appear in the plot window. Instead, we’ll load the plotly library (you’ll need to install this with install.packages("plotly")), and then run the ggplotly() function with the figure object:

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(p1)

While the figure will look largely the same, you should see that it is in the ‘Viewer’ tab not the plot tab. If you hover your cursor over the plot, you’ll see labels appear on the closest point, and there will be a menu at the top that allows you to zoom in and out. While useful for exploring data, these figures can also be embedded into html pages (as here) for public visualization.

Quitting R

When you are finished with R, exit by typing q() in the console (or going to [File] \(>\) [Quit R Studio]). You will be asked if you want to save your workspace. This is generally a good idea, as this will create a file containing all your current data (“.RData”), and the history (“.Rhistory”) of the commands you have used. If you restart R in the same directory, by clicking on an R script file, the workspace will be loaded automatically. If it doesn’t, you can load this by changing to the the correct working directory and typing:

load(".RData")

Where to get help

Working in R can be frustrating, with errors and warnings popping up from you know not where. In fact, problems arise so often, troubleshooting (which usually just means Googling) should be considered an inescapable component of programming.



Here are some of the best places to look for help: